DS 6030 | Fall 2021 | University of Virginia
NOTE TO GRADER: I collaborated with Geoff Hansen on problem 2, and referenced Class 16 Notes for all problems. I often used starter code from these class notes as well.
data.dir = 'https://mdporter.github.io/DS6030/data/' # data directory
library(R6030) # functions for DS-6030
library(tidyverse) # functions for data manipulation
library(mclust) # functions for mixture models
library(mixtools) # poisregmixEM() function
RFM analysis is an approach that some businesses use to understand their customers’ activities. At any point in time, a company can measure how recently a customer purchased a product (Recency), how many times they purchased a product (Frequency), and how much they have spent (Monetary Value). There are many ad-hoc attempts to segment/cluster customers based on the RFM scores (e.g., here is one based on using the customers’ rank of each dimension independently: https://joaocorreia.io/blog/rfm-analysis-increase-sales-by-segmenting-your-customers.html). In this problem you will use the clustering methods we covered in class to segment the customers.
The data for this problem can be found here: https://mdporter.github.io/DS6030/data//RFM.csv. Cluster based on the Recency, Frequency, and Monetary value columns.
## Read the data
file=file.path(data.dir, "RFM.csv")
df=read_csv(file)
#> Rows: 10000 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): id, Recency, Frequency, Monetary
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df=select(df, -id)
## Visualize
plot(df$Recency)
## Model-based clustering: recency, frequency, and monetary
cluster1 = Mclust(df,verbose=FALSE)
summary(cluster1)
#> ----------------------------------------------------
#> Gaussian finite mixture model fitted by EM algorithm
#> ----------------------------------------------------
#>
#> Mclust EVV (ellipsoidal, equal volume) model with 9 components:
#>
#> log-likelihood n df BIC ICL
#> -151784 10000 81 -304313 -314488
#>
#> Clustering table:
#> 1 2 3 4 5 6 7 8 9
#> 515 2472 1394 2530 1375 167 952 148 447
plot(cluster1, what="density")
I scaled the data features (mean = 0, SD =1) so that the monetary value feature didn’t dominate the linkage and distance metrics. For the distance metric, I used Euclidean distance, since it is the easiest to interpret for a layman. Although Euclidean distance should be avoided for frequency metrics alone, I believe it is a usetful metric for monetary value and recency in order to segment shoppers. I used complete linkage, which I perceive as a more conservative clustering approach, primarily because it resulted in a more evenly distributed dendrogram that has a slowing marginal distance increase after height 3. I decided to cut my dendrogram at height 4.775, yielding five clusters (K=5), because my elbow plot suggested only marginal decreases in height for additional clusters thereafter. Given this segmentation, customers 1 and 100 are not in the same cluster (customer 1 is in cluster 1, while customer 100 is in cluster 2).
## Scaling and distance
df_scaled=scale(df) # Sd=1, mean of 0
## Hierarchical cluster
dX=dist(df_scaled, method="euclidean")
hc=hclust(dX, method="complete")
plot(hc)
## Elbow plot
tibble(height = hc$height, K = row_number(-height)) %>%
ggplot(aes(K, height)) +
geom_line() +
geom_point(aes(color = "black")) +
scale_color_identity() +
coord_cartesian(xlim=c(1, 50), ylim=c(0, 15))
## Find the elbow height/K
test=tibble(height=hc$height,
K=row_number(-height))
test[(test$K==5),]
## Find membership vector and compare membership (K) of 100th and 1st customer in dataset.
yhat = cutree(hc, k=5)
result = tibble(membership=yhat)
result[100,1]==result[1,1]
Again, I scaled the data features (mean = 0, SD =1) so that the monetary value feature didn’t dominate the clustering process. I ran kmeans clustering for 1:25 clusters. I decided to choose four clusters (K=4) because the decrease in SSE thereafter was marginal, striking a balance between variance and bias. Customers 1 and 100 were not in the same cluster utilizing k-means.
## K means
#-- Run kmeans for multiple K
Kmax = 25
SSE = numeric(Kmax)
for(k in 1:Kmax){
km = kmeans(df_scaled, centers=k, nstart=25)
SSE[k] = km$tot.withinss
}
#-- Plot results
plot(1:Kmax, SSE, type='o', las=1, xlab="K")
## Customer comparison
km = kmeans(df_scaled, centers=4, nstart=25)
result=tibble(membership=km$cluster)
result[100,1]==result[1,1]
Again, I scaled the data features (mean = 0, SD =1) so that the monetary value feature didn’t dominate the clustering process. The optimal number of clusters was eight (9 were tested by default), according to the mclust package This component count resulted in maximum BIC (log-likelihood interpretation of mclust). The best model was a “VVE” model, which corresponds to variable shape, variable volume, and equal orientation of ellipsoidal components. Using this segmentation, customers 1 and 100 were not in the same cluster.
I think my choice of cluster modeling approach would depend on the type of customer data I was analyzing, as well as the goal of my analysis. For rapid decision making on highly discrete data, I would likely use k-means clustering or hierarchical clustering due to their hard classification approach. For more continuous data, such as churn rates, and in less time constrained environments, I would likely prefer model-based clustering, implementing the EM algorithm for softer classifications of my customer data.
plot(cluster1, what="classification")
The pmf of a Poisson random variable is: \[\begin{align*} f_k(x; \lambda_k) = \frac{\lambda_k^x e^{-\lambda_k}}{x!} \end{align*}\]
A two-component Poisson mixture model can be written: \[\begin{align*} f(x; \theta) = \pi \frac{\lambda_1^x e^{-\lambda_1}}{x!} + (1-\pi) \frac{\lambda_2^x e^{-\lambda_2}}{x!} \end{align*}\]
image1
image2
image2
image3
image3
#-- Run this code to generate the data
set.seed(123) # set seed for reproducibility
n = 200 # sample size
z = sample(1:2, size=n, replace=TRUE, prob=c(.25, .75)) # sample the latent class
theta = c(8, 16) # true parameters
y = ifelse(z==1, rpois(n, lambda=theta[1]), rpois(n, lambda=theta[2]))
poisregmixEM() in the R package mixtools is designed to estimate a mixture of Poisson regression models. We can still use this function for our problem of pmf estimation if it is recast as an intercept-only regression. To do so, set the \(x\) argument (predictors) to x = rep(1, length(y)) and addintercept = FALSE.
beta values (regression coefficients) are on the log scale.library(mixtools)
?poisregmixEM()
results=poisregmixEM(y=y,x=rep(1,length(y)),addintercept=FALSE)
#> number of iterations= 46
summary(results)
#> summary of poisregmixEM object:
#> comp 1 comp 2
#> lambda 0.728314 0.271686
#> beta1 2.781594 2.064644
#> loglik at estimate: -611.2
plot(x=0:10, dpois(0:10, lambda=results$lambda))
ADD SOLUTION HERE